Mini-Tutorial 1: Data Visualization Concepts

Introduction

“All work presented is my own. I have not communicated with or worked with anyone else on this exam.”

In this mini-tutorial I will show you the concepts relating to the Grammar of Graphics as well as dive into data visualization concepts from Data Visualization: A Practical Introduction. Going through this section will show you the foundation of data visualization and the reasons that this is so important to understand in the context of data.

Grammar of Graphics

The Grammar of Graphics includes these main concepts: Data, Geom, Mapping, and Faceting. Grammar of graphics is super important because they are the tools that all data vizualists use to create plots and other graphics. These tools let you build and customize any plot to meet the needs of a project and allow your audience to understand and analyze the data.

Data

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
standings_df <- read_csv("data/standings.csv")
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings_df
## # A tibble: 638 × 15
##    team   team_name  year  wins  loss points_for points_against points_differen…
##    <chr>  <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>            <dbl>
##  1 Miami  Dolphins   2000    11     5        323            226               97
##  2 India… Colts      2000    10     6        429            326              103
##  3 New Y… Jets       2000     9     7        321            321                0
##  4 Buffa… Bills      2000     8     8        315            350              -35
##  5 New E… Patriots   2000     5    11        276            338              -62
##  6 Tenne… Titans     2000    13     3        346            191              155
##  7 Balti… Ravens     2000    12     4        333            165              168
##  8 Pitts… Steelers   2000     9     7        321            255               66
##  9 Jacks… Jaguars    2000     7     9        367            327               40
## 10 Cinci… Bengals    2000     4    12        185            359             -174
## # … with 628 more rows, and 7 more variables: margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>

This data of NFL Standings from 2011 to 2019. Data is a major part of data visualization. It is a required parameter, but it is also important for you to understand the data that you are using before you make any graph.

Geom

The geom is the geometric object that you are using to represent the data. There are many different types of graphs that you could use to represent this data, but it is important that your graph represents the data well. That is why it is important to understand your data before you make these graphs.

NE_df <- standings_df %>% filter(team == "New England")
ggplot(data = NE_df, aes(x = year, y = wins)) +
  geom_col()

### This bar graph shows the number of New England Patriot wins through the years

ggplot(data = NE_df, aes(x = year, y = points_for)) + 
  geom_point()

### This scatter plot shows the amount of points the New England Patriots have scored from 2000 to 2019.

Mapping

Mapping includes all of the aesthetic functions that are available in ggplot in R. These include but are not limited to variables (x and y), color, size, shape, etc. We have already seen some mapping in the above examples, but there is much more that you can do like in these examples.

ggplot(data = standings_df, aes(x= wins, colour = team, fill = team)) +
  geom_bar()

### As you can see through this mapping I have been able to add color to represent each NFL team. However, I have made a graph that is really difficult to read and have an audience understand what they are looking at. This is allows me to introduce another important topic Faceting.

Faceting

For this section I have cut the data set down to just AFC east teams.

AFCEast_df <- standings_df %>% filter(team_name == "Patriots" | team_name == "Bills" | team_name == "Jets" | team_name == "Dolphins")
AFCEast_df
## # A tibble: 80 × 15
##    team   team_name  year  wins  loss points_for points_against points_differen…
##    <chr>  <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>            <dbl>
##  1 Miami  Dolphins   2000    11     5        323            226               97
##  2 New Y… Jets       2000     9     7        321            321                0
##  3 Buffa… Bills      2000     8     8        315            350              -35
##  4 New E… Patriots   2000     5    11        276            338              -62
##  5 New E… Patriots   2001    11     5        371            272               99
##  6 Miami  Dolphins   2001    11     5        344            290               54
##  7 New Y… Jets       2001    10     6        308            295               13
##  8 Buffa… Bills      2001     3    13        265            420             -155
##  9 New Y… Jets       2002     9     7        359            336               23
## 10 New E… Patriots   2002     9     7        381            346               35
## # … with 70 more rows, and 7 more variables: margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
ggplot(data = AFCEast_df, aes(x = year, y = wins, colour = team, fill = team)) +
  geom_col() +
  facet_wrap(~ team_name)

### Through faceting we can make a much more organized graph that allows the people looking at our analysis to get a better idea of what they are looking at. The graph in the prior section was very unorganized however the facet_wrap function allows up to see the win totals for these AFC East teams much easier. Additionally, it is much easier for a viewer of this faceted plot to compare the number of wins between these four teams

Problems with Honesty and Good Judgement

As a data visualist you see graphs and representations of data on an everyday basis. Through this repetition you can understand how to read graphs, and can spot when things are off. Unfortunately, not all people have the skills to accurately read or understand what a graph is showing. In this section we will use an Happy Planet Index data set to show some of the good practices to allow people to interpret graphs and data.

hpi_df <- read_csv("data/hpi-tidy.csv")
## Rows: 151 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, GovernanceRank, Region
## dbl (8): HPIRank, LifeExpectancy, Wellbeing, HappyLifeYears, Footprint, Happ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hpi_LE_df <- hpi_df %>% group_by(Region) %>%
  summarise(mean_LE = mean(LifeExpectancy))%>%
  select(Region, mean_LE) %>%
  arrange(desc(mean_LE)) %>%
  group_by(Region) %>%
  mutate(LEorder = fct_reorder(Region, mean_LE))

ggplot(data = hpi_LE_df, aes(x = LEorder, y = mean_LE, fill = LEorder)) +
  geom_col() +
  coord_flip() +
  scale_colour_viridis_b() +
  labs(x = "Region",
       y = "Mean Life Expectancy") +
  theme(legend.position = "none")

### We can see in this graph the mean life expectancies for each region in the data set. Although this graph is not necessarily misleading, for the population that does not have the skills that we have this could be tricky to figure out and interpret correctly.

hpi_df <- read_csv("data/hpi-tidy.csv")
## Rows: 151 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, GovernanceRank, Region
## dbl (8): HPIRank, LifeExpectancy, Wellbeing, HappyLifeYears, Footprint, Happ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hpi_LE_df <- hpi_df %>% group_by(Region) %>%
  summarise(mean_LE = mean(LifeExpectancy))%>%
  select(Region, mean_LE) %>%
  arrange(desc(mean_LE)) %>%
  group_by(Region) %>%
  ungroup() %>%
  mutate(LEorder = fct_reorder(Region, mean_LE))

ggplot(data = hpi_LE_df, aes(x = LEorder, y = mean_LE, fill = LEorder)) +
  geom_col() +
  coord_flip() +
  scale_colour_viridis_b() +
  labs(x = "Region",
       y = "Mean Life Expectancy") +
  theme(legend.position = "none")

### The difference in this graph is that I have put the regions in decsending order based on life expectancy. This allows people to clearly see the order in which the mean life expectancy is presented. Additionally, a major factor in allowing the audience to understand these bar charts is having a zero base. Notice all of these bars start at zero, this allows the viewer to understand this graph that much easier because they do not need to interpret the starting point of each bar.

Good Data

Having good data that is representative of the topic that you are trying to cover is extremely important when showing your charts in graphs to other people. There may be many times when missing data could make very small changes to your results, but there are other times where this type of data can lead to massive changes to your analysis. In this section we will you the Happy Planet Index once again to show some of the effects of not having good, well represented data.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_full <- ggplot(data = hpi_df, aes(x = GDPcapita, y = Wellbeing, label = Country)) +
  geom_point() +
  geom_smooth()

ggplotly(plot_full, tooltip = "label")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
hpi_df
## # A tibble: 151 × 11
##    HPIRank Country     LifeExpectancy Wellbeing HappyLifeYears Footprint
##      <dbl> <chr>                <dbl>     <dbl>          <dbl>     <dbl>
##  1     109 Afghanistan           48.7      4.76           29.0     0.540
##  2      18 Albania               76.9      5.27           48.8     1.81 
##  3      26 Algeria               73.1      5.24           46.2     1.65 
##  4     127 Angola                51.1      4.21           28.2     0.891
##  5      17 Argentina             75.9      6.44           55.0     2.71 
##  6      53 Armenia               74.2      4.37           41.9     1.73 
##  7      76 Australia             81.9      7.41           65.5     6.68 
##  8      48 Austria               80.9      7.35           64.3     5.29 
##  9      80 Azerbaijan            70.7      4.22           39.1     1.97 
## 10     146 Bahrain               75.1      4.55           43.5     6.65 
## # … with 141 more rows, and 5 more variables: HappyPlanetIndex <dbl>,
## #   Population <dbl>, GDPcapita <dbl>, GovernanceRank <chr>, Region <chr>

In this graph with all the countries we can see a nice representative relationship between GDP per Capita and Well Being. However, if we take out some of the data, making it not representative we see a much different story.

hpi_not_full_df <- hpi_df %>% slice(1:76)
hpi_not_full_df
## # A tibble: 76 × 11
##    HPIRank Country     LifeExpectancy Wellbeing HappyLifeYears Footprint
##      <dbl> <chr>                <dbl>     <dbl>          <dbl>     <dbl>
##  1     109 Afghanistan           48.7      4.76           29.0     0.540
##  2      18 Albania               76.9      5.27           48.8     1.81 
##  3      26 Algeria               73.1      5.24           46.2     1.65 
##  4     127 Angola                51.1      4.21           28.2     0.891
##  5      17 Argentina             75.9      6.44           55.0     2.71 
##  6      53 Armenia               74.2      4.37           41.9     1.73 
##  7      76 Australia             81.9      7.41           65.5     6.68 
##  8      48 Austria               80.9      7.35           64.3     5.29 
##  9      80 Azerbaijan            70.7      4.22           39.1     1.97 
## 10     146 Bahrain               75.1      4.55           43.5     6.65 
## # … with 66 more rows, and 5 more variables: HappyPlanetIndex <dbl>,
## #   Population <dbl>, GDPcapita <dbl>, GovernanceRank <chr>, Region <chr>
plot_not_full <- ggplot(data = hpi_not_full_df, aes(x = GDPcapita, y = Wellbeing, label = Country)) +
  geom_point() + 
  geom_smooth()

ggplotly(plot_not_full, tooltip = "label")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

In this plot half of the data set has been removed and the result of the trend is much different than the full data set. This is a very extreme example, but it shows the importance of making sure that you data set is full and representative. Viewers of this graph could easily interpret this by saying “that as GDP per Capita increases, Well being increases.” However, in the full data set we can see that interpretation is not necessarily correct.